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Abstract. The Bayesian measure of sample information about the parameter, known as 
Lindley's measure, is widely used in various problems such as developing prior distributions, 
models for the likelihood functions and optimal designs. The predictive information is defined 
similarly and used for model selection and optimal designs, though to a lesser extent. The 
parameter and predictive information measures are proper utility functions and have been 
also used in combination. Yet the relationship between the two measures and the effects 
of conditional dependence between the observable quantities on the Bayesian information 
measures remain unexplored. We address both issues. The relationship between the two in- 
formation measures is explored through the information provided by the sample about the 
parameter and prediction jointly. The role of dependence is explored along with the interplay 
between the information measures, prior and sampling design. For the conditionally indepen- 
Q\ \ dent sequence of observable quantities, decompositions of the joint information characterize 

Lindley's measure as the sample information about the parameter and prediction jointly 
and the predictive information as part of it. For the conditionally dependent case, the joint 
information about parameter and prediction exceeds Lindley's measure by an amount due 
to the dependence. More specific results are shown for the normal linear models and a broad 
subfamily of the exponential family. Conditionally independent samples provide relatively 
little information for prediction, and the gap between the parameter and predictive informa- 
tion measures grows rapidly with the sample size. Three dependence structures are studied: 
the intraclass (IC) and serially correlated (SC) normal models, and order statistics. For IC 
and SC models, the information about the mean parameter decreases and the predictive in- 
formation increases with the correlation, but the joint information is not monotone and has 
a unique minimum. Compensation of the loss of parameter information due to dependence 
requires larger samples. For the order statistics, the joint information exceeds Lindley's mea- 
sure by an amount which does not depend on the prior or the model for the data, but it is 
not monotone in the sample size and has a unique maximum. 

Key words and phrases: Bayesian predictive distribution, entropy, mutual information, op- 
timal design, reference prior, intraclass correlation, serial correlation, order statistics. 



00 
O 



X 

5-H 



Nader Ebrahimi is Professor, Division of Statistics, 
Northern Illinois University, DeKalb, Illinois 60155, 
USA (e-mail: nader@math.niu.edu) . Ehsan S. Soofi is 
Professor of Management Science and Statistics, 
Sheldon B. Lubar School of Business, University of 
Wisconsin- Milwaukee, PO Box 742, Milwaukee, 
Wisconsin 53201, USA (e-mail: esoofi@uwm.edu). Refik 
Soyer is Professor of Decision Sciences and Statistics, 
Department of Decision Sciences and Department of 



Statistics, George Washington University, Washington, 
DC 20052, USA (e-mail: soyer@gwu.edu). 
Corresponding author. 

This is an electronic reprint of the original article 
published by the Institute of Mathematical Statistics in 
Statistical Science, 2010, Vol. 25, No. 3, 348-367. This 
reprint differs from the original in pagination and 
typographic detail. 



2 



N. EBRAHIMI, E. S. SOOFI AND R. SOYER 



1. INTRODUCTION 

The elements of Bayesian information analysis are 
a set of n observations, denoted as an n x 1 vec- 
tor y generated from a sequence of random vari- 
ables Yi, Y2, ■ ■ ■ with a joint probability model f(y\0) 
where the parameter 6 has a prior probability dis- 
tribution f{6),9 S G and a new outcome Y u . We 
follow the convention of using uppercase letters for 
unknown quantities, which may be scalar or vec- 
tor. Whereas the concept of prediction is usually an 
afterthought in classical statistics, unless one deals 
with regression or forecasting type models, predic- 
tive inference naturally arises as a consequence of 
calculus of probability and is a standard output of 
Bayesian analysis. Bayesians are interested in pre- 
diction of future outcomes, because eventually they 
will be observed and allow to settle bets in the sense 
of de Finetti. The predictive inference is considered 
as a distinguishing feature of the Bayesian approach. 
But one cannot develop predictive inference without 
estimation, that is, without obtaining the posterior 
distribution of the parameter. The parameter plays 
the pivotal role in prediction, and a clear perspec- 
tive of the information provided by the sample about 
the parameter and prediction can be obtained only 
through viewing (G, Y u ) jointly. 

Information provided by the data refers to a mea- 
sure that quantifies changes from a prior to a pos- 
terior distribution of an unknown quantity. Lind- 
ley (1956) framed the problem of measuring sample 
information about the parameter in terms of Shan- 
non's (1948) notion of information in the noisy chan- 
nel (sample) about the signal transmitted from a 
source (parameter). The notion is operationalized 
in terms of entropy and mutual information mea- 
sures. Bernardo (1979a) showed that Lindley's mea- 
sure of information about the parameter is the ex- 
pected value of a logarithmic utility function for the 
decision problem of reporting a probability distribu- 
tion from the space of all distributions. The informa- 
tion utility function belongs to a large class of util- 
ity functions discussed by Good (1971) and others 
which lead to the posterior distribution given by the 
Bayes rule as the optimal distribution. The predic- 
tive version of Lindley's measure, referred to as pre- 
dictive information, quantifies the expected amount 
of information provided by the sample about pre- 
diction of a new outcome. 

A list of articles on Lindley's measure and its 
methodological applications is tabulated in the 



Appendix. The major areas of applications are clas- 
sified in terms of sampling design and developing 
models for the likelihood function, and developing 
prior and posterior distributions. Stone (1959) was 
first to apply Lindley's measure to the normal re- 
gression experiments and El-Sayyed (1969) was first 
to apply Lindley's measure to the exponential model. 
Following Bernardo (1979a, 1979b), several authors 
have presented evaluation and selection of the like- 
lihood function in terms of Lindley's measure as a 
Bayesian decision problem. Chaloner and Verdinelli 
(1995) provided an extensive review and additional 
references for the experimental design; see also the 
works of Barlow and Hsiung (1983) and Poison (1993) 
Soofi (1988, 1990) and Ebrahimi and Soofi (1990) 
examined the trade-offs between the prior and de- 
sign parameters for the information about the model 
parameter. Carota, Parmigiani and Poison (1996) 
developed an approximation for application to model 
elaboration. Yuan and Clarke (1999) proposed de- 
veloping the model for the likelihood function that 
maximizes Lindley's measure subject to a constraint 
in terms of the Bayes risk of the model. San Mar- 
tini and Spezzaferri (1984) used a version of the 
predictive information for model selection. Amaral 
and Dunsmore (1985) studied the predictive mea- 
sure and applied it to the exponential parameter. 
Verdinelli, Poison and Singpurwalla (1993) used the 
predictive information and Verdinelli (1992) consid- 
ered a linear combination of the parameter and pre- 
dictive information measures as design criteria. 

This article is another testimony of the depth and 
breadth of Lindley's pioneering work on the relation- 
ships between Shannon's information theory and 
Bayesian inference. We explore the relationship be- 
tween the parameter and predictive information mea- 
sures and examine the roles of prior, design and 
the dependence in the sequence Yi\9,i = 1,2, ... , on 
the information measures and their interrelation- 
ship. This expedition integrates and expands the ex- 
isting literature in three directions. 

First, to this date, the relationship between the 
sample information about the parameter (Lindley's 
measure) and predictive information remains unex- 
plored. Lindley's measure focuses on the information 
flow between the pair (Y,Q). The predictive infor- 
mation measure is based on the information flow be- 
tween the pair (Y, Y v ). The key to exploring the re- 
lationship between the information provided by the 
sample about the parameter and for the prediction 
is through viewing (Q, Y„) jointly as an interrelated 
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pair. In this perspective, plays an intermediary 
role in the information flow from the data y to the 
prediction quantity Y v . The information flow from Y 
to the pair (Y,Y U ) is different when Yi\6, i = 1, 2, . . . , 
are conditionally independent and conditionally de- 
pendent. Panel (a) of Figure 1 depicts the condition- 
ally independent model and its information flow di- 
agram. In this case, the parameter 9 is the only link 
between Y and Y u , thus the information flows from 
the data to the predictive distribution solely through 
the parameter. This information flow from Y to 
to Y v is analogous to the data processing of the in- 
formation theory (Cover and Thomas, 1991) where 
(Y, Q,Y U ) is a Markovian triplet. We will show that 
in this case the sample information about the pa- 
rameter is in fact the entire information provided 
by Y about (@,Y U ) jointly, and that the predictive 
information is only a part of it. We will further show 
that for some important classes of models, such as 
the normal linear model and a large family of life- 
time models, the predictive information provided by 
the conditionally independent sample is only a small 
fraction of the parameter (joint) information. 

Second, thus far, the effects of dependence in the 
sequence Yi\8, i = 1,2, . . . , on the Bayesian informa- 
tion measures remain unexplored. Panel (b) of Fig- 
ure 1 shows the graphical representations of the con- 
ditionally dependent model and its information flow 
diagram. In this case, the information flows from the 
data to predictive distribution directly due to the 
conditional dependence, as well as indirectly via the 
parameter. Consequently, the relationship between 
the parameter and predictive information measures 
is quite different than that for the conditionally in- 
dependent case. We will show that for the condition- 
ally dependent case, the sample information for the 
pair (0,Y"j,) decomposes into the information about 
the parameter (Lindley's measure) and an informa- 
tion measure mapping the conditional dependence. 
We study the role of dependence for three impor- 
tant cases: the intraclass (IC) and serial correlation 
(SC) dependence structures for the normal sample, 
and order statistics where no particular distribu- 
tion is specified for the likelihood and prior. Esti- 
mation of the normal mean and prediction under 
the IC and SC models are commonplace. We ex- 
amine the effects of dependence on the parameter 
and predictive information measures drawing from 
Pourahmadi and Soofi's (2000) study of information 
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Fig. 1. Graphics of conditional independent and dependent 
models, (a) Conditional independent, (b) Conditional depen- 
dent. 

measures for prediction of future outcomes in time 
series. We will show that the sample can provide 
a substantial amount of information for prediction 
and the dominance of parameter information that 
was noted for the conditionally independent case no 
longer holds. Order statistics, which conditional on 
the parameter form a Markovian sequence (Arnold, 
Balakrishnan and Nagaraja, 1992), also provide a 
useful context for studying the effects of dependence 
on information measures. For example, in life test- 
ing, the information that the first r failure times 
provide about the model parameter as well as about 
the time to next failure l^+i are of interest. Here, 
n items are under the test, failures are observed one 
at a time, and it is desirable to determine at an 
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early stage how costly the testing is going to be 
and whether an action such as a redesign is war- 
ranted. Such joint parameter-predictive inferences 
were considered by Lawless (1971), Kaminsky and 
Rhodin (1985) and Ebrahimi (1992) under various 
sampling plans. 

Third, the Bayesian information research has fo- 
cused either on the design or on the prior. The past 
research has mainly used two types of models en- 
compassing two different parameters: the linear model 
for the normal mean parameter, and the lifetime 
model where the scale parameter of an exponential 
family distribution is of interest. We consider the 
normal linear model with normal prior distribution 
for the mean and a subfamily of the exponential 
family under the gamma prior distribution for the 
scale parameter. This subfamily includes the expo- 
nential distribution and many parametric families 
such as Weibull, Pareto and Gumbel extreme value. 
For each class of models, we examine the relation- 
ships between the parameter and predictive infor- 
mation measures. Furthermore, we explore the ef- 
fects of sampling plan and prior distribution on the 
parameter and predictive information measures. We 
will show that under the optimal design for the pa- 
rameter estimation, the loss of information for pre- 
diction is not nearly as severe as the loss of informa- 
tion about the parameter under the optimal design 
for prediction. 

This article is organized as follows. Section 2 pre- 
sents the measures of information provided by the 
sample about the parameter and prediction, includ- 
ing results on the relationship between them for the 
conditionally independent model. Section 3 explores 
the measures of information provided by the sample 
about the parameter and prediction in terms of the 
prior and design matrix for linear models. Section 
4 explores the measures of information provided by 
the sample about the parameter and prediction for 
a subfamily of the exponential family and explores 
the interplay between parameter and predictive in- 
formation for a broad family of distributions gener- 
ated by transformations of the exponential model. 
Section 5 examines information measures for con- 
ditionally dependent samples. Section 6 gives the 
concluding remarks. The Appendix provides a clas- 
sification of the literature on Bayesian applications 
of the mutual information and some technical de- 
tails. 



2. INFORMATION MEASURES 

Let Q represent the unknown quantity of inter- 
est: 0, Y u , individually or as a pair, or a function of 
them. For notational convenience we represent prob- 
ability distribution with its density function /(■) and 
use subscript i for the elements of data vector y and 
Y u , v ^ i, for prediction. Information provided by 
the data y about Q is measured by a function that 
maps changes between a prior distribution f(q) and 
the posterior distribution f(q\y) obtained via the 
Bayes rule. Two measures of changes of the prior 
and posterior distributions are as follows. The un- 
certainty about Q is measured by Shannon entropy 



H(Q) = H(f) = - j f(q) log f(q) 



dq, 



and the observed sample information about Q is 
measured by the entropy difference 



(1) 



AH(y;Q) = H(Q)-H(Q\y). 



The information discrepancy between the prior and 
posterior distributions is measured by the Kullback- 
Leibler divergence 



(2) K[f(q\y):f(q)\ 



f(q\y)log^-dq>0, 



where the equality in (2) holds if and only if f(q\y) = 
f{q) almost everywhere. The observed sample infor- 
mation measure (1) can be positive or negative de- 
pending on which of the two distributions is more 
concentrated (less uniform) . For a /c-dimensional ran- 
dom vector Q , an orthonormal k x k matrix A and a 
kxl vector c, H(AQ + c) = H(Q), but (1) is invari- 
ant under all linear transformations of Q. The infor- 
mation discrepancy (2) is a relative entropy which 
only detects changes between the prior and the pos- 
terior, without indicating which of the two distribu- 
tions is more informative. It is invariant under all 
one-to-one transformations of Q. 

The expected sample information measures are 
obtained by viewing the observed information mea- 
sures (1) and (2) as functions of the data and averag- 
ing them with respect to the marginal distribution 
of Y. The expected entropy difference and expected 
Kullback-Leibler divergence provide the same mea- 
sure, known as the mutual information 



(3) 



M(Y;Q) = E y {AH(y;Q)} 

= E y {K[f(q\y):f(q)]}, 
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where E y denotes averaging with respect to 

/(y) = J f(6)f(y\e)de. 

Other representations of M(Y; Q) are 
M(Y;Q) = H(Q)-H(Q\Y) 

(4) 

= K[f(q,y):f(q)f(y)}, 

where 

H(Q\Y) = E y {H(Q\y)} = j H(Q\y)f(y)dy 

is referred to as the conditional entropy in the infor- 
mation theory literature. The first representations 
in (3) and (4) are in terms of the expected uncer- 
tainty reduction, and the second representation in 
(4) shows that the mutual information is symmetric 
in Q and Y. It is noteworthy to mention that the 
equalities in (3) and (4) do not hold, in general, for 
generalizations of Shannon entropy and Kullback- 
Leibler information divergence, such as Renyi mea- 
sures; see the article by Ebrahimi, Soon and Soyer 
(2010). 

Some useful properties of the mutual information 
are as follows: 

1. M(Y;Q) > 0, where the equality holds if and 
only if Q and Y are independent. 

2. The conditional mutual information is defined 
by M(Y;Q\S) = E s [M(Y;Q\s)] > 0, where the 
equality holds if and only if Q and Y are condi- 
tionally independent. 

3. Given f(q), M(Y;Q) is convex in f(q\y) and 
given f(q\y), M(Y;Q) is concave in f(q). 

4. Let Y n denote a vector of dimension n, Yj G Y n 
and Yj ^ Y n _i. Then 

M(Y n ; Q) = M(Y„_i;Q) + M(Q; 5$|Y n _i) 

(5) 

>M(Y n _i;Q), 

thus M(Y n ; Q) is increasing in n. 

5. M(Y; Q) is invariant under one-to-one transfor- 
mations of Q and Y. 

2.1 Marginal Information 

For Q = 0, the observation y provides the likeli- 
hood function, C{9) cx f(y\9) and updates the prior 
to the posterior distribution 

(6) /(%)cx/(0)/(y|0). 



The expected sample information about the param- 
eter, M(Y; 0), is known as Lindley's measure (Lind- 
ley, 1956) and is referred to as the parameter infor- 
mation. 

The following properties are also well known: 

1. Let S n = S(Y) be a general transformation. Then 
M(Y; 0) > M(S n ; 0), where the equality holds if 
and only if S n is a sufficient statistic for 9. 

2. M(Y n ;0) is concave in n, which implies that 
M(Yj;e|Y n _i)<M(l$;e). 

3. Ignorance between two neighboring values in the 
parameter space, P(9) = P{9 + 5(9)) = 0.5, im- 
plies that M(Y; 0) « 25 2 (9)1 F (9) as 59 -> 0, where 
Tp{9) is Fisher information (Lindley, 1961, page 
467). Similar approximation holds more gener- 
ally for M(Y; Q); see the classic book of Kullback 
(1959). 

For Q = Y U , the prior and posterior predictive dis- 
tributions, respectively, are given by 

f(Vu) = j f(yv\9)f(9)d0 

and 

(7) f(Vu\y) = J f(yu\0)f(9\y)d9. 

The expected information M(Y; Y v ) is referred to as 
the predictive information (San Martini and Spez- 
zaferri, 1984; Amaral and Dunsmore, 1985). 

In some problems, both the parameter and the 
prediction are of interest (Chaloner and Verdinelli, 
1995). Verdinelli (1992) proposed the linear combi- 
nation of marginal utilities 

(8) U(Y; 0, Y v ) = w\M(Y; 0) + w 2 M(Y; Y u ), 

where Wk > 0, k = 1, 2, are weights that reflect the 
relative importance of the parameter and prediction 
for the experimenter. Since and Y v are not inde- 
pendent quantities, M(Y; 0) and M(Y; Y v ) are not 
additively separable. The weights in (8) do not take 
into account the dependence between the prediction 
and the parameter. 

2.2 Joint Information 

Taking the dependence between the parameter and 
prediction into account requires considering the joint 
information for the vector of parameter and predic- 
tion. The observed and expected information mea- 
sures are defined by (1) and (3) where Q = (Q,Y U ), 
and will be denoted as AH[y; (0, Y u )] and M[Y; 
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(Q,Y U )]. The next theorem encapsulates the rela- 
tionships between the joint, parameter and predic- 
tive information measures for the conditionally in- 
dependent samples. 

Theorem 1. If Yi\6,Y 2 \9,... are conditionally 
independent, then: 

(a) AH(y;@)=AH[y;(@,Y 1/ )]; 

(b) M(Y;0) = M[Y;(0,Y,)]; 

(c) M(Y; Y v ) < M(Y; 0) . 

Proof. The proof of (a) is as follows. The joint 
entropy decomposes additively as 

H{Q,Y u )=H{Q) + n{Y u \Q), 

where U{Y V \Q) = E e {H(Y v \9)} is the conditional 
entropy. Letting Q = (0, Y„) in (1) and applying the 
entropy decomposition to each entropy, we have 

AH[y; (0, Y v )] = H(@) +U(Y V \Q) 

-{H(e\y)+H(Y„\Q,y)}, 

where H(Y U \Q, y) = Eg{H{Y v \9, y)}. The first and 
third terms give Ai7(y;0). The conditional inde- 
pendence implies for each 9, H[f(y u \9,y)] = H[f(y v \ 
9)}, thus Eg{H(X v \0,y)} = E g {H(Y u \0)}, and the 
second and fourth terms cancel out, which gives (a). 
Since Y — > — > Y v is a Markovian triplet, parts (b) 
and (c) are implied by properties of the mutual infor- 
mation functions of Markovian sequences (see, e.g., 
Cover and Thomas, 1991, pages 27, 32-33). □ 

By part (a) of Theorem 1, under the condition- 
ally independent model, the information provided 
by each and every sample about the parameter is 
the same as the joint information for the parameter 
and prediction. 

Part (b) of Theorem 1 provides a broader inter- 
pretation of Lindley's information, namely expected 
information provided by the data about the param- 
eter and for the prediction. An immediate implica- 
tion is that the prior distribution (Bernardo, 1979a, 
1979b), the design (Chaloner and Verdinelli, 1995; 
Poison, 1993) and the likelihood model (Yuan and 
Clarke, 1999) that maximize M(Y; 0) also maxi- 
mize sample information about the parameter and 
prediction jointly. However, by part (c) of Theorem 
1, such optimal prior, design, and model may not be 
optimal according to M(Y;Y U ). Similarly, the opti- 
mal design of Verdinelli, Poison and Singpurwalla 
(1993) and the optimal model of San Martini and 
Spezzaferri (1984) which maximize M(Y;Y U ) may 
not be optimal according to M(Y;0). 



The inequality in (c) is the Bayesian version of 
the information processing inequality of information 
theory, and can be referred to as the Bayesian data 
processing inequality mapping the information flow 
Y — > — > Y u through (6) and (7), as shown in Fig- 
ure 1(a). 

By part (b) of Theorem 1 and decomposition of 
M[Y;(0,Y,)] we have 

(9) M(Y;0) = M(Y;Y,) + M(Y;0|Y,), 

where M[(Y;Q)\Y U ] = E Vu {K[f(y, 9)\y u ) : f(9\ 
Uv) f {y\Uu)]} is the conditional mutual information 
between and Y, given Y v . This measure is the 
link between the parameter and predictive informa- 
tion measures and is key for studying their relation- 
ship. Applying (9) to the utility function (8) gives 
the weights for the additive information measures in 
(9) as 

U(Y;®,Y„) 

(10) 

= W1 M(Y; &\Y V ) + ( Wl + w 2 )M(Y; Y v ). 

3. LINEAR MODELS 

Consider the normal linear model 
y = X(3 + e, 

where y is an n x 1 vector of observations, X is 
an n x p design matrix, (3 is the p x 1 parameter 
vector, e is the error vector. Under the conditionally 
independent model f(e\(3) = N(0,afl n ), a\ > is 
known and I n is identity matrix of dimension n. 

It will be more insightful to use the orthonor- 
mal rotation Z = XG and 6 = G'f3, where G is 
the matrix of eigenvectors of X'X, and A = Z'Z = 
diag[Ai, . . . , Xp] where Xj > 0,j = 1, . . . ,p, are the 
eigenvalues of X'X . By the invariance of entropy 
under orthonormal transformations, AH(y;&) = 
AH(y; (3) and by invariance of mutual information 
under all one-to-one transformations, M(Y; 0) = 
M(Y;(3). 

We use the normal conjugate prior f(6) = N(uiq, 
°o^b)) where Vo = diagf^oi, . . . ,Vq p ]. The posterior 
distribution is f(6\y) = 2V(mi, a\ V\) where mi = 
Vr^rjV^mo + Z'y), V 1 = (nV^ 1 + Z'Z)' 1 and r) = 

2 

-k. All distributions and informationc measures 
are conditional on Z and a\ which are assumed 
to be given. The prior and posterior entropies are 
H(e\a 2 k V k ) = 2 log(2^e) + \ log |a^ fe |, k = 0, 1, where 
| • | denotes the determinant. Since entropy is loca- 
tion invariant, nij. does not matter. Also since V\ 
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does not depend on data y, the conditional entropy 
and posterior entropies are equal, H(®\Y, Z, rj, Vq) = 
H(&\y,Z,r},Vo). Thus, the observed and expected 
sample information measures are the same, given 
by 

M(Y;®\Z,r),V )=AH(y;®\Z,rj,V ) 



(11) 



^\og\I p + n - X VvZ'Z\ 
1 v 

-^iog(i+7rV?Aj). 



From (11) it is clear that the parameter (joint) in- 
formation is decreasing in rj and increasing in vqj , Xj 
and <7q. Thus, given the prior, the information can 
be optimized through the choices of design parame- 
ters Xj,j = l,...,p, and for given data (design), the 
information can be optimized through the prior pa- 
rameters ctq and Voj,j = 1, . . . ,p. 

The prior and posterior predictive distributions of 
a future outcome Y v to be taken at a point z v are 
normal N(z' 1/ /j, k , a^.z' v VkZ v + crf),k = 0, 1 and 

M(Y;Y v \z v ,Z,rj,Vo) 

(12) = AH(y;Y u \z u ,Z,rj,V ) 



rj v z v Vqz v + 1 



z' v V lZu + 1 

Parts (a) and (b) of Theorem 1 give AH[y; (0, Y v )\ 
z u ,Z,rj,V ] = AH(y;@\Z,r,,V ) and M[Y; (0, 
Y v )\z v , Z, rj, V ] = M(Y; ®\Z, rj, Vq). Therefore all ex- 
isting results for M (Y; ®\Z, rj, Vq) apply to the joint 
parameter and predictive information, as well. Part 
(c) of Theorem 1 provides an additional insight: 
M(Y;Y u \z u ,Z,ri,Vo) < M(Y; ®\Z,rj, Vq). These re- 
lationships hold for multiple predictions, as well. 

3.1 Optimal Designs 

Several authors have studied parameter informa- 
tion in the context of experimental design. It is clear 
from (11) that given Vq = I p and the trace Ti(Z'Z) = 
Y2^=i Aj> the optimal parameter information design 
is obtained when all eigenvalues are equal, A, = A = 
- Y%=i ^k, which gives the Bayesian D-optimal de- 
sign (see Chaloner and Verdinelli, 1995, for refer- 
ences). That is, with the uncorrelated prior the in- 
formation optimal design is orthogonal. For the case 
of weak prior information, cTq — > 00, maximizing the 
expected parameter information gain is equivalent 
to the classical criterion of D-optimality. If the ex- 
perimental information is weak, then the Bayesian 



criterion reduces to the classical criterion of A-opti- 
mality when Vq = I p (Poison, 1993). Verdinelli, Pol- 
son and Singpurwalla (1993) used the predictive in- 
formation optimal design for accelerated life testing. 

To illustrate implications of Theorem 1 for de- 
sign we consider the simple case when Xij E {0,1}. 
This is a one-way ANOVA structure, when the av- 
erages (parameters) as well as contrasts between 
the individual outcomes are of interest. In this case, 
Tr(A) = Y^j=i n j = n an d the design parameters are 
Xj = rij. The following proposition gives the optimal 
designs according to the parameter (joint) informa- 
tion M(Y; 0) and predictive information M(Y; Y u ). 

Proposition 1 . Given rj, Vq and Y^j=i n j = n: 

(a) The optimal sample allocation scheme accord- 
ing to the parameter (joint) information M(Y;0) 
is 



(13) 



n. 



n 



n 



+ ?£(% 1 -^ 1 ), 



,p, 



and the minimum sample size is determined by n\ > 
max{(u 0j 1 - v^)rj, j = 2,...,p}. 

(b) The information optimal sample allocation sche- 
me according to the predictive information M(Y; Y u ) 
for prediction at z„ is 



(14) 



\Zvi\n 



X]j=l \ z uj\ 



+ 



"oi 1 )' 



J 



r^j 1 
Wl I 



*3> j= 2 

V 



\ z v\\ 



(l^i |«o/ ~ Kil u oi)> 



.p. 



and the minimum sample size is determined by n\ > 

Proof. See the Appendix. □ 

Note that by Theorem 1, the maximum predic- 
tive information attained with optimal design (14) 
is dominated by the parameter information: 

M{Y-®\n*,z v ) = M[Y; (® ,Y v )\n* ,z v ] 
>M(Y;Y u \n*,z u ). 

Example 1. Let p = 2,n = W,vq\ = ^02 = l,f? = 
1 and z' u z u = 1. Figure 2(a) shows the plots of the 
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a) Parameter information 



Predictive optimal design 




0.4 
Zl 



b) Predicl i ve information 



0.25 



.Predictive optimal design 




□ .4 
Zl 



0.5 



Fig. 2. Parameter information per dimension 
A/(Y; &\Z, rj)/p and predictive information M(Y;Y^\Z,z,n) 
under the predictive and parameter optimal designs against 
z vl ,z' v z,v = 1 for p = 2, n — 10, t] = l,«oi = «02 = 1. 

parameter information measures under the param- 
eter and predictive optimal designs against z v \. In 
order to make the two information measures dimen- 
sionally comparable, we have plotted information 
per parameter M(Y; 0|n|) = M(Y; Q\n*)/p. Fig- 
ure 2(b) shows the plots of the predictive informa- 
tion measures under the parameter and predictive 
optimal designs. Note that the vertical axes of the 
two panels are different. These plots show that the 
parameter (joint) information per dimension is much 
higher than the predictive information even when 
the design is optimal for prediction and not for the 
parameter. The dashed lines show the information 
quantities for the D-optimal design, which is opti- 
mal for the parameter (joint) and for prediction at 
the diagonal z\ = Z2 = l/\/2 ~ 0.707. The sample is 
least informative for prediction in this direction. We 
note that the loss of information for prediction is not 
nearly as severe as the loss of information about the 
parameter. This is due to the fact that by Theorem 
1, the parameter information measures the joint in- 



formation about the parameter and prediction and 
is inclusive of the predictive information. Thus, use 
of the D-optimal design would be preferable if the 
experimenter has interest in inference about the pa- 
rameter as well as about a prediction. 

3.2 Optimal Prior Variance 

Next we illustrate application to developing prior 
in the context of a Bayesian solution to the collinear- 
ity problem. When the regression matrix X is ill- 
conditioned, posterior inference about individual pa- 
rameters is unreliable. The effects of collinearity on 
the posterior distribution and compensating for the 
collinearity effects by using Vq = I p were discussed 
by Soofi (1990). In the orthogonal prior variance 
case v 0j = P i s distributed uniformly among 

the components of Vo- The following proposition 
gives an optimal prior variance allocation accord- 
ing to the parameter (joint) information M(Y; 0) 
that will be useful when X'X is nearly singular. 

Proposition 2. Let \\ > ■ ■ ■ > X p , Y%=i = P> 
and given rj and Y2j=i v 0j = c - The optimal prior 
variance allocation according to the parameter (joint) 
information M(Y; 0) is 



(15) 



fc-f+jEW-*.- 1 ). 

y y 3=2 

u oj = v oi - vi^j 1 - Af 1 ), J = 2, . . . ,p, 

and the minimum prior variance is determined by 

v oi > (v 1 - A r 1 ) ? ?- 

Proof. See the Appendix. □ 

The optimal information prior (15) allocates prior 
variances to the components 8j,j = 1, . . . ,p, based 
on the eigenvalues Ai > • ■ • > \ p of X'X. So it is 
in the same spirit as Zellner's g prior (Zellner, 1986) 
where VQj oc A" 1 , j = 1, . . . , p. In the same spirit, West 
(2003) and Maruyama and George (2010) have de- 
fined generalized g priors that are applicable when 
X is singular. Our information optimal allocation 
scheme is another generalization of the g prior tai- 
lored for the collinearity problem where X is full- 
rank, but nearly singular. 

The optimal allocation scheme (15) can be repre- 
sented in terms of the condition indices Hj = \J Ai/Aj, 
j = 1, . . . ,p, of X'X as 



XjV* 0j + r, 
v 

K j=l 



.p. 



c. 
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The smallest portion of the total prior variance Vq p 
is allocated to the component 8 P that corresponds 
to the smallest eigenvalue X p such that 



a) Parameter infoimation 



A i'»0i+ r ? _ K 2 
X P v 0p+V ' 

is the condition num- 



where k = k(X'X) = y/\i/\ p 
her of X'X which is used for collinearity diagnostics 
(Stewart, 1987; Soofi, 1990; Belsley, 1991). 

In some prediction problems, the prediction point 
z„ is given. For example, in the accelerated life test- 
ing, z v is the environmental condition and the ex- 
periment must be designed such that prediction at 
z v is optimal. The information decomposition (11) 
provides the clue when the quantity of interest is 
the mean response Q = E(Y\z u ). The components of 
6 = (#i, . . . , p )' are independent, a priori and a pos- 
teriori, and from (11), M(9j,Y\Z,r),V ) = 0.51og(l + 
r]~ 1 voj\j). Under the orthogonal prior, the sample 
is most informative about the linear combination 
of the regression coefficients 8\ = where g\ is 
the first eigenvector of X'X. Thus the optimal de- 
sign for the expected response at a covariate vec- 
tor z u is X* such that z„ is the first eigenvector 
of I p + i) VqX*'X* . Under the uncorrelated prior 
or weak prior, X* is frequentist E-optimal design, 
which can be different than the designs that are 
optimal with respect to parameter (joint) informa- 
tion. The optimal allocation scheme (15) provides 
improvement to the orthogonal prior for prediction 
of the expected response when z u is in the space of 
the eigenvectors corresponding to the large eigenval- 
ues. 

Example 2. Let p = 2, c = 100 and r\ = 1. Fig- 
ure 3 compares information measures for the op- 
timal scheme, the orthogonal prior and Vq oc A~ x 
which is used in some priors such as the g-prior. 
Figure 3(a) shows the plots of parameter informa- 
tion M (Y; & \Z,rj) against the condition number k = 
yAi/A2 of X'X. Under all three priors, the parame- 
ter information M(Y; & \Z, rf) decreases with k, that 
is, as the regression matrix descends toward singu- 
larity. The parameter information under the optimal 
scheme slightly dominates the measure under the or- 
thogonal prior, and both dominate the information 
under the (/-prior which deteriorates quickly with 
collinearity. By Theorem 1, the parameter informa- 
tion measure is the joint information about the pa- 
rameter and prediction and is inclusive of the predic- 
tive information. Figure 3(b) shows M(Y; 8\\Z, rj) 
for the direction of the first eigenvector 6\ = G't/3, 
that is, the most informative direction for prediction 



Optimal scheme 



Identity matrix 



g-pnors 



14 s 

Condition number 

b) Most informative direction 



4 













Optimal scheme , 






Identity matrix 






■• - . 

~ — g-priors 


1 


4 


S 12 






Condition number 



Fig. 3. Parameter information M{Y;®\Z,ri) and informa- 
tion for the most informative direction for prediction of the 
expected response M(Y;6i\Z, n) for three types of prior vari- 
ance allocations (p = 2,c = 100, r\ = 1). 

of the expected response. The optimal and orthogo- 
nal priors improve the information under collinear- 
ity, but the measure for the g-prior deteriorates quickly. 

4. EXPONENTIAL FAMILY 

Consider distributions in the exponential family 
that provide likelihood functions in the form of 

-9s n 



(16) 



8>0, 



where s n is a sufficient statistic for 8. This is the 
likelihood function for an important class of mod- 
els referred to as the time-transformed exponential 
(TTE) (Barlow and Hsiung, 1983). The TTE models 
are usually defined in terms of the survival function 
F(y\9) =exp{-6cf)(y)},y > 0, where <j>{y) = -logF 
and 6 is the "proportional hazard." The density 
functions of the TTE models are in the form of 



(17) 



f(cP(y)\9)=0cf ) '(y)e- e ^, 



where <fi(y) is a one-to-one transformation of Y with 
the exponential distribution f(y\9) = 9e~ dy . For TTE 
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models s n = Y17=i ^(m) ■ Examples include the ex- 
ponential 4>(y) = y,y > 0, Weibull 4>(y) = y q ,y > 0, 
Pareto Type I 4>(y) = log(y/a),y > a > 0, Pareto 
Type II (j)(y) = log(l + y),y > 0, Pareto Type VI 
4>{y) = log(l + y a ), y > 0, a > and the extreme value 
<f>(y) = ev. 

The family of conjugate priors for (16) is gamma 
G(a,f3) with density function 



(18) 



/(*) 



ia— 1 , 



T(a) 



The posterior distribution is Q(a + n,/3 + s n ). 
The information in the observed sample is given 

by 

AH(y; 0) = Hg(a) - Hg(a + n)+ log(l + ^) , 

where Hg(a) is the entropy of G(a, 1) given by 
Hg(a) = logT(a) — (a — l)ip(a) + a, 

and ip(a) = fjlo |^ ct ^ is the digamma function. 

For the TTE family (17), the marginal distribu- 
tion of s n is inverted beta (beta prime) distribution 
with density 

1/P {Sn/PY- 1 



f{Sn) 



Sn > 0, 



B(a,n) (l + s n //?) a+n 

where B{a,n) is the beta function. Using E Sn {\og(l + 
%■)} = ip{a + n) — ip(a), the expected information 
for all models with likelihood functions in the form 
of (16) is 

M[Y; (0; Y v )) = M(Y; B) 

(19) =Hg(a)-H g (a + n) 

+ if) (a + n) — if) (a). 

An interesting property of (19) is the following 
recursion: 



(20) 



M(Y n ;G|a) = M(Y n _ i; G|a) 

+ K g (a + n-l), 



where Y n and Y n _i are vectors of dimensions n and 
n — 1, and 

(21) Kg(v) = K(G V : G v+l ) = -+ if>(v) - log v 

V 

is the Kullback-Leibler information between Q v = 
Q(v,f3) and G v +i =Q(v + l,f3). The recursion (20) is 
found using if) (a + 1) = if) (a) + ±. By (5), M(Y„; 6| 
Y re _i) = Kg(a + n — 1). That is, on average, the in- 
cremental contribution of an additional observation 



is equivalent to the information divergence due to 
one unit increase of the prior shape parameter. 

The prior predictive distribution for the exponen- 
tial model is Pareto V(a,f3) with density function 



f{Vu) 



a+l 



y v >o. 



The posterior predictive distribution f(y u \y) is also 
Pareto with the updated parameters V(a + n,/3 + 
s n ). The predictive information measures are given 
by 

AH(y;Y u ) = H v {a) -H v {a + n) 



(22) 



M(Y; Y v 



H-p(a) — H-p(a + n) 
— tf)(a + n) + tf)(a), 



where H-p(a) = — — loga + 1 is the entropy of V(a, 1). 

By invariance of the mutual information, the ex- 
pected predictive information for TTE family (17) 
is given by (22). 

By Theorem 1, AH[s n ; (0, Y v )\ = Aif(y;0), 
M[Y;(e,Y u )] = M(Y;0) and M(Y;Y U ) < M(Y; 
0). The following theorem gives a more specific pat- 
tern of relationships. 

Theorem 2. The followiny results hold for the 
TTE family (17) and gamma prior (18): 

(a) M(Y;0|a) and M(Y; Y v \ a) are decreasing func- 
tions of a, increasing functions of n and as n — > oo, 
M(Y n+1 ;e\a)-M(Y n ;@\a) ->-0 andM(Y',Y v \a) -»• 

(b) M(Y;0|a) = M(Y;Y u \a) + M(Y;0|a + 1), 
where M(Y; 0|a + 1) is i/ie sample information with 
gamma prior Q(a + 1,{3). 

(c) M (Y; 0| a) — M(Y; Y^|a) increases with a and 
with n. 

Proof. For (a), it is known that the expected 
parameter and predictive measures are increasing 
functions of n. It was shown by Ebrahimi and Soofi 
(1990) that for the exponential model, M(Y; 0|a) is 
decreasing in a. By the invariance of the mutual in- 
formation the same result holds for the TTE family. 
The limits are found by noting that Kg(v) — > as 
v — > oo. The expected predictive measure decreas- 
ing in a is found by taking the derivative, using 
series expansion of the trigamma function if>'(u) = 
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X)fc^=i (u+k)^ (Abramowitz and Stegun, 1970), and 
an induction on n that shows the derivative is neg- 
ative. Part (b) is found using recursion ip(a + 1) = 
ip(a) + ^. Part (c) is implied by (a) and (b). The 
difference is M(Y; Q\a + 1) which is increasing (de- 
creasing) in n (a). □ 

By part (a) of Theorem 2, the parameter and pre- 
dictive information both increase with n. Part (b) 
of Theorem 2 gives the relationship between the pa- 
rameter (joint) and the predictive information mea- 
sures. Part (c) indicates that under conditional inde- 
pendence, the parameter (joint) information grows 
faster than the predictive information with the sam- 
ple size. 

Example 3. As an application, consider Type 
II censoring where observing the number of failures 
is a design parameter. For the exponential model, 
the sufficient statistic for in (16) is the total time 
under the test 

t r = yi-\ \-y r -i + (n - r + l)y r , r<n, 

where y± < y<i < • • • < y n are the order statistics of a 
sample of size n. The parameter information M(T r ; 
8|n) is given by (19) and the predictive information 
M(T r ; Y u \a, n) is given by (22) with n = r. Ebrahimi 
and Soofi (1990) examined the loss of information 
about the exponential failure rate. By part (a) of 
Theorem 2, censoring also results in loss of predic- 
tive information. As in the case of parameter infor- 
mation, the loss of predictive information can be 
compensated by the prior parameter a. Figure 4 
shows plots of the expected parameter and predic- 
tive information measures. Figure 4(a) illustrates 
the information decomposition part [Theorem 2, part 
(b)] for a = 1 as function of n. The parameter in- 
formation and predictive information are both in- 
creasing in n. The parameter information increases 
at a faster rate than the predictive information. In 
this case, the difference between the parameter and 
predictive information is M(y; @\a + 1), also shown 
in Figure 4(a). These information measures are de- 
creasing in a. Figure 4(b) shows the plots of loss of 
information due to Type II censoring for n = 25 and 
a = 1,2. We note that the predictive information 
loss is not as severe as the parameter information 
loss. As seen in the figure, the information losses 
can be recovered by increase in prior precision. 

By part (a) of Theorem 2, M(Y; 6) and M(Y; Y v ) 
are maximized by choosing a as small as possible. 



a) Information decomposition 




n 



l>) Information loss due to censoring (n-25) 




r 



Fig. 4. Decomposition of the joint (parameter) in- 
formation M(T r ;&\a,n) into predictive information 
M(T r ;Y v \a,n) and M(T r ; Q\a + 1, n) and loss of information 
M{T n ;Q\a) - M(T r ;@\a) due to Type II censoring of 
exponential data. 

It is natural to expect that the limiting case, which 
is the Jeffreys prior f{6) oc (9" 1 , be optimal with 
respect to both the parameter and prediction in- 
formation. But its use is consequential. Since the 
Jeffreys prior is improper, the expected parameter 
information is given by the negative conditional en- 
tropy of the posterior distribution, which is proper. 
However, unlike the mutual information, the entropy 
is not invariant under one-to-one transformations 
and the result depends on the parametric function 
of interest. For example, for the exponential model, 
the posterior distribution of failure rate 6 is gamma 
f{0\s n ) =G(n,s n ) and its entropy is H[f(9\s n )] = 
Hg(n) — logs n . The distribution of S n is Pareto 
f(s n ) oc s~ n which is proper for n > 1 and s n > sq > 
0. The expected parameter information, I(Q\S n ) = 
— H[f(Q\sn)], is a decreasing function of n. But the 
posterior distribution of the mean parameter fj, = 
O^ 1 is inverse-gamma and information about the 
mean is increasing in n. With the Jeffreys prior, the 
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prior predictive distribution is also improper. The 
posterior predictive is Pareto V(n,s n ) and its en- 
tropy is H\f{Y u \s n )] = H -pin) + log s n . The expected 
predictive information is I(Y v \S n ) = — H[f(Y u \s n )], 
which is an increasing function of n. 

5. DEPENDENT SEQUENCES 

When the sequence of random variables Yi\6,i = 
1,2,..., is not conditionally independent, the infor- 
mation provided by the sample about the parameter 
and prediction jointly decomposes as 

M[Y; (9, Y„)] = M(Y; G) + M(Y; Y V \Q) 

(23) 

= M(Y;Y U )+M(Y;Q\Y U ), 

where M(Y, Y,,|©) > is the measure of conditional 
dependence, hence the inequality becomes equality 
for the case of conditional independence and (23) 
gives (9). Thus, for the conditionally dependent se- 
quence, M[Y;(@,Y„)] exceeds M(Y;0) by the 
amount M(Y;Y U \Q) > 0. Also from (23), we find 
that 

M (Y; 9) < M(Y; Y v ) if and only if 
M{Y-Q\y v )<M{Y;Y v \Q). 

For strongly conditional dependent sequence, the 
second inequality is plausible and the predictive in- 
formation M(Y, Y u ) can dominate the parameter in- 
formation M(Y;9). 

In this section we first examine the effects of cor- 
relation between observations on the information 
about the mean parameter and prediction where the 
data are normally distributed. We then consider or- 
der statistics where no particular prior distribution 
and model for the likelihood function are assumed. 

5.1 Intraclass and Serially Correlated Models 

We consider the intercept linear model f(y\6) = 
N(9z,afR), where z is an n x 1 vector of ones and 
R = R\9 = [pij\$] is a known correlation matrix. By 
invariance of the mutual information, the results 
hold for all distributions of variables that are one-to- 
one transformations of elements of y, for example, 
log-normal model. As before, af > is known and 
f{9) = N(fiQ, o"q). The posterior variance is given by 
<r 2 e\ y = ^ol 1 + T n (R)il'T\ where T n {R) = z'R^z 
is the sum of all elements of R . The parameter 
information is given by 

(24) M(Y;9| J R) = 0.51og(l+7 7 - 1 r n (^)). 



The following representations facilitate computa- 
tion and study of the predictive and joint informa- 
tion measures. If Y v and Y v \y are normal, then the 
predictive information is given by 

M(Y;F l/ ) = -0.51og(l-p^ y ) 

(25) 

= 0.51og[C- 1 ] w , 

where p^ )y is the square of unconditional multiple 
correlation coefficient of the regression of Y v on y, 
C = [cij],i,j = 1, . . . , n + 1, denotes the correlation 
matrix of the (n + l)-dimensional vector (Y,Y U ), 
and \C~^\ VV denotes the (u, v) element of C _1 . 

The joint information about the parameter and 
prediction can be computed by the first decomposi- 
tion in (23), 

(26) M[Y; (9, Y v )\ = M(Y; &\R) + M(Y,Y U \Q), 

where M(Y;9|-R) is given in (24) and the measure 
of conditional dependence can be computed simi- 
larly to (25): 

M(Y;F|9) = -0.51og(l-^ jy|e ) 

(27) 

= 0.51og[C">]^>0, 

where ,q is the square of conditional multiple 

correlation coefficient and C\6 = [cij\0],i,j = 1, . . . , 
n + 1 , is the correlation matrix of conditional distri- 
bution of (Y,Y U ), given 9. Note that C\9 includes R 
and an additional row and column for Y v . 

Measures such as the determinant \R\ and condi- 
tion number k(R) = yj Ai/A„, where Ai < • • • < A n 
are eigenvalues of R, can be used to rank depen- 
dence of the normal samples. However, in general, 
these measures do not provide a unique ranking. In 
order to rank the dependence uniquely as well as for 
ranking the predictive information in terms of sam- 
ple dependence, we assume some structures for R. 
We consider two important models: the intraclass 
(IC) model with p^Q = p for all i ^ j, and the serial 
correlation (SC) model with p iti ± k \Q = p k > 0, k > 0. 
Dependence within each of these models and be- 
tween the two models is ranked uniquely by \R\ and 
k(R). 

Table 1 shows \R\ and T n (R) for the IC, SC mod- 
els along with the independent (uncorrelated) model 
(UC). The determinants and inverses of the IC and 
SC matrices are well known. Using T n (R) in (24) 
gives the parameter information. The third row of 
Table 1 shows p 2 y y ^ g which is computed using (27) 
with (n + l)-dimensional IC and SC structures for C. 
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Table 1 

Formulas for uncorrelated, intraclass and serial correlation models 



Uncorrelated (UC) 



Intraclass (IC) 



Serial correlation (SC) 



Conditional sequence 

\R\e\ 

Tn(R\9) 
2 

Py„,y\B 
Predictive sequence 

2 
Pp 

2 



1 

n 




i 

l+i 7 



[l + (n-l)p](l- P y 



(l + ij)(n + 77) 



l + (n-l)p 

np 
l + (n-l)p 



1 + ')P 



l+(n— l)p p 



1-P Z 
n — {n — 2)p 



l + V 



Immediate future pp 



Table 1 also shows the square of unconditional (pre- 
dictive) correlation p 2 p = Cij , which is used in (25) 
for computing the predictive information measures. 
Computation of p 2 = is shown in the Appendix. 
The last row of Table 1 shows the square of uncon- 
ditional multiple correlation coefficient Py VjV com- 
puted from (25). The predictive measure for the SC 
model is for the one-step prediction. 

The effects of prior on the information quantities 
are induced through r/ which is proportional to prior 
precision. Clearly (24) is decreasing in n. Using the 
last two rows of Table 1 it can be shown that (25) 
and the difference between (24) and (25) are also 
decreasing in 77. Thus, the optimal prior for inference 
about the parameter and prediction is to choose the 
prior variance as large as possible. 

The following theorem summarizes the effects of 
the IC and SC correlation structures on the normal 
information measures (24)-(26). 

Theorem 3. 

(a) For all three models, M(Y;9|p), M(Y;Y u \p) 
and M[Y; (9, Y"„)|p)] increase with n and decrease 
with rj. 

(b) For both IC and SC models, M(Y;9|p) de- 
creases with p, and 

M IC (Y; &\p) < M SC (Y; Q\p) < M UC (Y; 9), 

where the last equality holds if and only if p = 0. 

(c) For both IC and SC models, M(Y;Y u \p) in- 
creases with p, and 

M IC (Y;Y u \p) > M sc (Y;Y u \p) > M uc (Y;Y U ), 

where the last equality holds if and only if p = 0. 

(d) For both IC and SC models, M[Y; (e,Y u )\p)] 
decreases in p for p < po(n, rj) and increases in p for 
p > po(n,f]), where pQ C (n,n) and pQ C (n,rj) are roots 



of quadratic equations and both are increasing in n 
and decreasing in n. 

Proof, (a) Can be easily seen by taking deriva- 
tives, (b) It is also easy to see that for the correlated 
models T n (R) are decreasing functions of p and that 
Tl c (R) < T,^ C (R) < T^ C {R) = n. (c) This is im- 
plied by the facts that p p > p^ and the predictive 
information increases with p, as expected, (d) Tak- 
ing the derivative, po(n,r]) is given by the root of 
A n ^p 2 + B nyV p+C niV = 0, where A*° v = n-l,B^ v = 
2(1 + mr 1 ) , A%C = l + (2n - l)^ 1 >£g = 1 + (2n - 
I) 7 ? 1 an d Cn,?? = (1 — n)?7 -1 . For each model there 
is only a unique positive solution. □ 

Theorem 3 formalizes the intuition that samples 
with stronger dependence are less informative about 
the parameter and more informative about predic- 
tion. Since M(Y;&\p) is increasing in n, one can 
compensate the loss of parameter information due to 
the dependence by increasing the sample size. The 
following example illustrates these and some other 
noteworthy points. 

Example 4. 

(a) Figure 5 shows plots of M(Y; Q\p) and M(Y; 
Y u \p) against sample size for the UC model and the 
correlated models IC and SC with p = 0.50, 0.75. 
Plots in panels (a) and (b) reveal the following fea- 
tures. 

(i) All information measures are increasing in n. 

(ii) For the UC model, the parameter informa- 
tion is the highest and has the fastest rate of increase 
with n, and the predictive information is the lowest 
with the slowest (almost flat) rate of increase. 

(hi) For the SC model, the parameter information 
is higher and increases much faster than the predic- 
tive information. 
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a) Parameter information 




a) Joint information 
4 



b) Pmclic liver information 




Fig. 5. The parameter information M(Y;Q\p) and predic- 
tive information M(Y;Y v \p) for the independent, IC and SC 
normal models as functions of the sample size (r\ = 0.5 ). 




b) Minimum joint information 




Fig. 6. The joint parameter and predictive informa- 
tion M\Y\(Q,Y v )\p] and minima of the joint information 
min p M[Y; (G, Y„)|p] for SC and IC normal models. 



(iv) For the IC model, the parameter information 
is lower than the predictive information while both 
measures have about the same rates of increase. 

(v) Interestingly, for the UC and SC models, the 
differences between the parameter and predictive in- 
formation measures grow with n much faster than 
the predictive information measures. That is, the 
share of predictive information decreases with the 
sample size. 

(vi) As can be seen in Figure 5(a), to gain about 
one unit (nit) of information, we need n = 3 from 
the UC, and with p = 0.50,0.75, we need n = 8, 16 
observations under SC, and n = 26,37 observations 
under IC models, respectively. 

(b) Figure 6(a) shows the plots of the joint in- 
formation measures for the SC and IC models as 
functions of p 2 for n = 5, 10 and r] = 0.5. Note that 
the joint information of the SC model dominates 
the joint information of the IC model when depen- 
dence is weak. After the minimum point, the rate of 
growth of joint information for the IC model is steep 



and the IC information measure dominates the SC 
information measure when the dependence is rather 
strong. 

(c) Figure 6(b) shows the plots of the minimum 
joint information measures for the SC and IC fam- 
ilies as functions of n for rj = 0.25, 0.5, 0.75. These 
plots are useful for determining sample size for each 
family such that the minimum information exceeds 
a given value. For example, to gain about 1.5 units 
(nits) of information from an SC sample with un- 
known p, we need n = 9, 25, 37 with 77 = 0.25, 0.50, 0.75, 
respectively. The plots show that 

A/ /c [Y; (0, Y„)|n,7j] < M 5C [Y; (Q,Y u )\n, V ], 

where M [Y; (Q,Y u )\n,rj] = mm p M[Y;(Q,Y u )\p]. 
This inequality can be proved by substituting 
plf (n,rj) and pQ C (n,r]) in the expressions for T n (R) 

and Pl,y\9- 

5.2 Order Statistics 

Let Y\ < Y% < ■ • ■ < Y n be the order statistics of 
conditionally independent sample X\, . . . , X n from a 
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continuous distribution with density function g(x\9), 
and let y r = (j/i, . . . ,y r ),r < n. Conditional on 9, 
the order statistics have a Markovian dependence 
structure (Arnold, 1992). The mutual information 
between consecutive order statistics is given by 

M(Y r ;Y r+1 \6) 

= M n (r) 

(28) = log _B(r + l,n -r + 1) + log(n + 1) - 1 

— r{tfj(r) — ip{n)} 

— (n — r){ip{n — r) — ip(n)}; 

see the article by Ebrahimi, Soon and Zahedi (2004). 
That is, M n (r) is the measure of Markovian depen- 
dence between order statistics of the independent 
sample conditional on 6. It was shown by Ebrahimi, 
Soon and Zahedi (2004) that M n {r) is increasing in 
n, and for a given n, the information is symmetric in 
r and n — r, and attains its maximum at the median 
(see Figure 7) . The next lemma gives generalizations 
of (28). All information functions are conditional on 
r and n, which will be suppressed when unnecessary. 

Lemma 1 . Let Y\ < ■ ■ ■ < Y n denote the order 
statistics of random variables Xi, . . . , X n which, given 
6, are independent and have identical distribution 
g(x\8) and Y r and Y q denote the disjoint subvec- 
tors of order statistics. Then: 

(a) M(Y r ; Y q \9) is free from the parent distribu- 
tion g{x\6) and the prior distribution f(0). 

(b) For any two consecutive subvectors Y r = 

(^fc+l) • • • i ^fc+r) an d Y q = (Yfc +r _|_i, . . . , Yfc_|_ r+(? ) ; 

M(Y r ;Y q \9) = M n (k + r). 

PROOF. Let U = G{X). Then U is uniform and 
its order statistics W\ < W2 < • • • < W n are given 
by Wi = G(Yi), and W r and W q are the subvec- 
tors corresponding to Y r and Y q . Since Wi = GiYi) 
is one-to-one, we have M(Y r ; Y q ) = M(W r ;W q ). 
Furthermore the distribution of any subset of or- 
der statistics is ordered Dirichlet with parameters n 
and the indices of the order statistics contained in 
the subset, hence M(Y r ;Y q ) = M(W r ;W q ) is free 
from the parent distribution g(x\9). Part (b) follows 
from Y\\0, . . . ,Y n \0 being a Markovian sequence. □ 

It can easily be shown that information provided 
by the first r order statistics about the parameter 
M(Y r ,0) satisfies (5). The predictive distributions 
of order statistics are given by f(yi) = f f(yi\0) x 
f{6) d0, i = 1, . . . , n. Note that y\ < 1/2 < • • • < y n are 
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(b) 

Fig. 7. Expected information about the parameter M(Y r , O) 
and the joint information about the parameter and prediction 
of the (r + l)st order statistic M[Y r ;(Q,Y r +i)] provided by 
the vector of preceding order statistics Y r , and the informa- 
tion due to the Markovian dependence between order statistics 
(n = 26). 

the order statistics of a sample of the exchangeable 
sequence X\ , . . . , X n , unconditionally. The following 
results provide some insight about the parameter 
and predictive information for order statistics. 

Theorem 4. Let M[Y r ; (9, Y r+ i)] denote the in- 
formation provided by the first r order statistics about 
the parameter and for prediction of the next order 
statistic jointly. Then: 

(a) M[(Y r ;(e,y r+ i)] = M(Y r ;6) + M n (r) > 
M(Y r ;9). 

(b) The following statements are equivalent: 

(i) M(Y r ;Y r+1 )>(<)M n (r). 

(ii) M(6;Y r+1 ) > (<)M(Y r+1 ;6) - M(Y r ;G), 
where Y r+i = (Yi , . . . , Y r , Y r+1 ) . 
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Proof. Using the following decompositions of 
mutual information, we have 

A/[(e,K r+1 );Y r ]=M(Y r ;e)+M(Y r ;y r+1 |e). 

Applying part (b) of Lemma 1 to the second term 
gives the result (a). For (b) we use the following 
decompositions of mutual information: 

M[(Y r , 0); Y r+1 ] = M(Y r ; Y r+1 ) + M(0; Y r+1 \ Y r ) 

= M(e ; y r+ i) + M(Y r . ; y r+1 |e). 

Equating the two decompositions with M(Y r+ i; Y r | 
0) = M n {r) gives equivalence of (i) and 

(29) M(0;Y r+1 ) > (<)M(0; Y r+l \Y r ). 

The equivalence with (ii) is obtained by solving 

M (Y r+ i; 0) = M(Y r ; 0) + M(0; Y r+ i|Y r ) 

for M(0;Y r+1 |Y r ) and substituting in (29). □ 

Part (a) of Theorem 4 shows M[Y r ; (0, Y r+1 )] is 
inclusive of Lindley's measure reflecting the fact that 
conditional on 9, order statistics are dependent. So 
the information provided by the first r order statis- 
tics about the parameter and for prediction of the 
next order statistic is more than the information 
provided about the parameter. However, the excess 
information amount measures the Markovian depen- 
dence between order statistics of the independent 
sample and does not depend on g x \ d and fg. An 
implication of this result is that reference posterior 
corresponding to the prior that maximizes the pa- 
rameter information M(Y r ; 0) also remains optimal 
with respect to M[Y r ; (0, Y r+ i)]. 

Part (b) of Theorem 4 gives the equivalence of 
the orders of information in terms of (i) the predic- 
tive and sample order statistics and (ii) the expected 
information about the parameter provided by an or- 
der statistic in terms of the incremental amount of 
information provided about the parameter. 

Example 5. For the case of exponential model 
with the gamma prior, the conditional distribution 
of (r + l)st order statistic given 9 and the first r 
order statistics is exponential with density 

f{yr+i\y r ,8) 

(30) 

= (n - r )9e- e{n - r){yr+1 - yr) , y r+1 > y r . 

The posterior predictive distribution of (r + l)st or- 
der statistic given first r order statistics is Pareto 
with parameters a + r,b r = ^t^T and a location pa- 
rameter y r . Since entropy is location- invariant, 



H(Y r+1 \y r ) is H(Y r+ i\t r ,y r ,r,n) = H(Y v \t r ,r) - 
log(n — r). Figure 7 illustrates some properties of 
these information measures for the exponential model 
and n = 26. 

(a) Figure 7(a) shows plots of M(Y r ;@\a,n) = 
M(T r ;Q\a,n) for a = 0.5, 1,2, 4, superimposed by 
the Markovian dependence information measure M n (r) 
for the order statistics. Since M (T r ; 0|a, n) is in- 
creasing in r, censoring results in loss of information 
about the parameter. Thus, without consideration of 
cost of the experiment, r* = n = 26. Since M n (r) is 
decreasing for r larger than the median, censoring 
beyond the median results in gain of information 
about the next outcome. 

(b) Figure 7(b) shows the plots of the parame- 
ter information M(Y r ; 0|a, n) and joint information 
M[Y r ; (0, y r _|_i)|a, r, n] computed using part (a) of 
Theorem 4 for a = 0.5, 1. We note that M[Y r ; (0, 
Y r+ \)\a,r, n] is not monotone because the Marko- 
vian dependence information measure M n {r) 
decreases for the order statistics above the median. 
The optimal r for the joint parameter and predic- 
tive information, without consideration of cost the 
experiment, is r* = 17 < n. Thus, unlike the case of 
conditionally independent model, the parameter in- 
formation utility and the joint parameter-predictive 
information utility lead to different sampling plans. 

In Section 4 we noted that under the Jeffreys prior, 
at least one observation is needed for obtaining a 
proper posterior. Following this idea more generally, 
we compare the expected uncertainty change due to 
the first r order statistics with the first order statis- 
tic r = 1 , given by 

iB[Y r ;(e,y r+1 )] = fr[(e,ri)]-«[(e,r r+ i)|Y r ] J 

where«[(e > r r+1 )|Y r ]=E ap {ff[(e,Y r+ i)|Y r ]}isthe 
conditional joint entropy of (0, Y r +i) given the first 
order statistic, averaged with respect to f(y r )- The 
expected uncertainty change £>(Y r ; Y r +i) for predic- 
tion of (r + l)st order statistic is defined similarly. 
These measures, which can be referred to as the in- 
formation bridge between the first and (r + l)st or- 
der statistics, are invariant under linear transforma- 
tions, but can be negative. It can be shown that for 
any parent distribution g(x\6) where 9 is the scale 
parameter and any prior f(9), 

M(Y r ; 0) = B[Y r - (0, Y r+1 )] + log( — ) , 

\n — r J 

M(Y r ; Y r+1 ) = B(Y r ; Y r+1 ) + log ( — ) . 
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Clearly, B(-, -\r,n) — > M(-, -|r,n) as ^—^0. So, the 
quantity log(^-p) can be interpreted as the finite 
sample correction factor for the information. 

6. CONCLUSIONS 

This article is the first attempt to study the re- 
lationship between the parameter and predictive in- 
formation measures, the analytical behavior of the 
predictive information in terms of prior parameters 
and the effects of conditional dependence between 
the observable quantities on the Bayesian informa- 
tion measures. We provided analytical results and 
showed applications in some statistical and model- 
ing problems. 

The measure of information that sample provides 
about the parameter and prediction jointly led to 
some new insights about the marginal parameter 
and predictive information measures. For the case of 
conditionally independent observations, decomposi- 
tions of the joint information revealed that the pa- 
rameter information is in fact the measure of infor- 
mation about the parameter and prediction jointly. 
This finding implies that all existing results about 
Lindley's information are applicable to the joint mea- 
sure of parameter and predictive information. In 
particular, the reference posterior and the optimal 
design that maximize the sample information about 
the parameter are also optimal solutions for the sam- 
ple information about the parameter and prediction 
jointly. Yet another information decomposition re- 
vealed that predictive information is a part of the 
information that sample provides about the param- 
eter. 

We examined interplay between the information 
measures and the prior and design parameters for 
two general classes of models: the linear models for 
the normal mean, and a broad subfamily of the ex- 
ponential family. A few applications showed the use- 
fulness of the information measures and some in- 
sights were developed. A proposition provided the 
optimal designs with respect to the parameter (joint) 
information and predictive information measures for 
an AN OVA type model. The results include the min- 
imum sample sizes required in terms of the given 
prior variances and the covariate vector for the pre- 
diction. Another proposition provided the optimal 
prior variance allocation scheme with respect to the 
parameter (joint) information for collinear regres- 
sion, which includes the minimum prior variance 
required for the problem. Examples for the linear 



and the exponential family models revealed that the 
predictive information provided by the conditionally 
independent sample is only a small fraction of the 
parameter (joint) information and the gap between 
the parameter and predictive information measures 
grows rapidly with the sample size. This finding in- 
dicates that despite the importance of prediction in 
the Bayesian paradigm, the parameter takes the ma- 
jor share of the information provided by condition- 
ally independent samples. An example examined the 
parameter information when the parameter of in- 
terest is the vector of means of two treatments and 
the predictive information of interest is the weighted 
average (or contrast) between outcomes of the two 
treatments. This example revealed that the loss of 
information about the parameter under the opti- 
mal design for predictive information is much higher 
than the loss of predictive information under the 
optimal design for the parameter information. The 
parameter is the major shareholder of the sample 
information so its loss is more severe than the loss 
of predictive information under suboptimal designs. 

We have examined, for the first time, the role of 
conditional dependence between observable quanti- 
ties on the sample information about the parameter 
and prediction. For a dependent sequence, the joint 
parameter and predictive information decomposes 
into the parameter information (Lindley's measure) 
and an information measure mapping the conditional 
dependence. We provided more specific results for 
correlated variables whose distributions can be trans- 
formed to normal and for the order statistics with- 
out any distributional assumption. For the normal 
sample, we compared the information measures for 
the independent, the intraclass correlation and serial 
correlation models. We showed that the parameter 
information decreases and predictive information in- 
creases with the correlation. However, the joint in- 
formation decreases in the correlation to a minimum 
point, which is determined by the prior precision 
and sample size, and then increases. For condition- 
ally dependent sequences, the dominance of parame- 
ter information that was noted for the conditionally 
independent samples does not hold. Since all infor- 
mation measures increase with the sample size, loss 
of parameter information due to dependence can be 
offset by taking larger samples. 

Order statistics also provided a context for in- 
formation analysis of conditionally Markovian se- 
quences. Extension of a result on information prop- 
erties of order statistics was needed to show that the 
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Markovian dependence measure depends neither on 
the model for the data, nor on the prior distribution 
for the parameter. By this finding, the reference pos- 
terior that maximizes the sample information about 
the parameter retains its optimality according to the 
joint parameter and predictive information measure 
of the order statistics. An example illustrated impli- 
cation in terms of the optimal number of failures to 
be observed under Type II censoring. 

APPENDIX 

A.l Classification of Literature 

Table 2 gives a classification of literature on the 
Bayesian applications of mutual information. Sev- 
eral authors have used information in various Bayesian 
contexts, which are not listed in Table 2; exam- 
ples include Aitchison (1975), Zellner (1977, 1988), 
Geisser (1993), Keyes and Levy (1996), Ibrahim and 
Chen (2000), Brown, George and Xu (2008). Nico- 
lae, Meng and Kong (2008) defined some measures 
of fraction of missing information and have pointed 
out connection between their measures and the en- 
tropy, stating that "essentially all measures we pre- 
sented have entropy flavor." Measures of informa- 
tion for nonparametric Bayesian data analysis are 
also available (Miiller and Quintana, 2004). Since 
our focus is on the mutual information, for exam- 
ple, Lindley's measure and its predictive version, we 
did not discuss other information measures. 



The first term does not depend on the design, so it 
is sufficient to minimize 

h(n 1 ,...,n p ) =z' u (r]V ~ 1 + Z'Z)~ 1 z u 



V 0j Z i 



subject to the constraint Y^j=i n j = n - Letting n\ 
n — 5Z?=2 n j gi yes the first-order conditions 

dh(ni,...,n p ) _ 



< z 1 



drij 



+ 



A. 2 Proof of Proposition 1 

(a) Noting that Xj = nj,j = 1 
» n. in 



,p, and letting 



n\ = n — J2 P j=2 n j i n (H) gives the first-order condi- 
tions 

dM(Y;@\Z, V ,V ) 



drij 



v 0j 



VQl 



7] + v 0j rij rj + VQini 



0, j = 2,..., p. 



Solutions to this system give n*, j = 2, . . . ,p, in (13) 
and n* is found from n\ = n — Y2j=2 n j ■ ^ can ^ e 
verified by the second-order conditions that the so- 
lutions give the maximum. 

(b) Using V x = {rjV^ 1 + Z' Z)" 1 in (12) gives 

M(Y;Y v \z v ,Z,ti,Vo) 
= -log{r J - 1 z' 1/ V z„ + l) 

-Uog^v^+z'zr^ + i). 



{rj + vojiij) 2 {r] + v m ni) 2 
= 0, j = 2,..., p. 

Solutions to this system give n|, j = 2, . . . ,p, in (14) 
and n* is found from n\ = n — Y^=2 n *j • ^ can ^ e 
verified by the second-order conditions that the so- 
lutions give the maximum. 

A. 3 Proof of Proposition 2 

The solutions are found similarly to part (a) of 
Proposition 1 by taking the derivative of (11) with 
respect to voj subject to Y^j=i v 0j = c - 

A. 4 Computation of Normal Predictive 
Correlation 

We compute the predictive correlation p p through 
the well-known formula for partial correlation: 



(A.l) 



Pij\k 



Pij - PikPjk 



(i-p? fc ) 1/2 (i-/4) 1/2 ' 



ik> V x rjk) 

In our case, i,j, k represent Yi,Y v and 6, respectively. 
Note that 

'2 



Pi9 = l 



e\vi 



1 







1 + 7? 
>% in ( 

tional (predictive) correlation as 



for all % = 1,2, . . 



Letting pf k = p 2 - k = pj e in (A.l) gives the uncondi- 



Piv = P% + 0- - P%)Piv\t 



1 + riPiv\e 
l + f] 



Letting p iv \Q = 0, p, p u -i,v > i, respectively for UC, 
IC and SC models, we obtain the entries of Table 1 
for the three models. 
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Table 2 

Classification of articles on Lindley's measure of sample information about the parameter and 

its predictive version 



Parameter information: 

Likelihood model and design: 
Lindley (1956, 1957, 1961), Stone (1959), El-Sayyed (1969), Brooks (1980, 1982), 
Smith and Verdinelli (1980), Turrero (1989), Barlow and Hsiung (1983), 

Soon (1988, 1990), Ebrahimi and Soon (1990), Carlin and Poison (1991), Verdinelli and Kadane (1992), 

Poison (1992), Verdinelli (1992), Parmigiani and Berry (1994), Chaloner and Verdinelli (1995), 

Carota et al. (1996), Singpurwalla (1996), Yuan and Clarke (1999) 
Prior and posterior distributions: 

Bernardo (1979a, 1979b), Soofi (1988, 1990), Ebrahimi and Soon (1990), Bernardo and Rueda (2002), 

Bernardo (2005) 
Predictive information 

Likelihood model and design: 

San Martini and Spezzaferri (1984), Amaral and Dunsmore (1985), Verdinelli (1993), 

Verdinelli et al. (1993), Chaloner and Verdinelli (1995), Singpurwalla (1996) 



distribution raised by Jie Feng in a doctoral seminar 
course on Bayesian Statistics at Sheldon B. Lubar 
School of Business. Ehsan Soofi's research was par- 
tially supported by a Sheldon B. Lubar School's 
Business Advisory Council Summer Research Fel- 
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